Previous
Gradient-Based RL
Contents
Table of Contents
Next
Evaluation Metrics

17.3. Optimization

AI has a big overlap with the general area of optimization because the search for the desired model is a process of identifying the model that best meets the goal. In some algorithms, such models can be derived analytically. Such deduction can also be counted as optimization because we essentially use the exact optimization methods. Examples include the linear models and support vector machines, which involve the linear least squares method and the Lagrangian multipliers method, respectively. Though many people do not treat these efforts as optimization, most people would agree those gradient descent methods in deep learning, which involve an interactive process for identifying the optimal model, should touch the core of optimization. But no matter how we view this overlap, it seems to be consensed that optimization tools, at least some optimization methods, are an essential part of modern AI predominated by machine learning. In typical optimization problems, we try to maximize or minimize an objective function, f ( x ) f ( x ) f( vec(x))f(\vec{x})f(x), where x x vec(x)\vec{x}x is a vector of continuous/discrete values.
The classification of optimization methods can be rather diverse, depending on classification criteria including but not limited to the consideration of uncertainty, continuity of the parameters to be optimized, continuity and differentiability of the objective functions, search for global vs. local optima, existence of constraints, and number of objective functions. As mentioned above, optimization methods in machine learning can be roughly classified as exact methods (linear programming, quadratic programming, dynamic programming, branch and bound), general approximate (heuristic) methods (Hill Climbing, Tabu Search, Simulated Annealing, Evolutionary Algorithms (including Genetic Algorithm), Swarming Intelligence Algorithms (Ant Colony, Honeybee Mating, Particle Swarm)), gradient-based methods (first order gradient) (GD, BGD, MBGD, SGD, Nesterov, Momentum, AdaGrad, Adam, AdaDelta, RMSProp), and Newton's and quasi-Newton methods (second order gradient) (including BGFS). The conjugate direction method and its nonlinear variants can be regarded as a hybrid between the gradient-based methods (first-order method that uses the steepest descent gradient) and Newton's method (second-order method that uses Hessian as well). Another major category of optimization methods in AI, Expectation Maximization (EM) algorithms, can be roughly viewed as a category of more general approximate methods.
Exact methods, gradient-based methods, and EM are the three main categories of optimization methods in contemporary AI. In this book, linear least squares, Lagrangian multipliers, and gradient-based methods have been well discussed in the chapters for linear models, SVM, and deep learning. Newton's and quasi-Newton methods gained popularity in deep learning but have not been discussed, so some information will be given in this section. Conjugate gradient methods, which are very useful but have not been that widely adopted in AI, i.e., in deep learning, due to a variety of reasons, will also be discussed considering their wide applicability and possible use in the future of AI. EM has lots of applications in machine learning including Gaussian Mixtures, clustering algorithms, K-means, and HMM, etc., but it has not been systematically discussed in the previous chapters. Thus, EM will also be explained in the following.

17.3.1. Gradient-Based Methods

Gradient descent is a predominant category of optimization methods in deep learning, thus, a must-know. The basic algorithm is the Batch Gradient Descent (BGD), which uses the gradients of the whole batch (all the training samples) for update. Mini-Batch Gradient Descent (MGBD) is the one that we possibly most frequently use. Here, we draw a portion of samples, i.e., mini batch, from the whole batch and update the model parameters using the gradients calculated with this mini-batch of samples. Thus, this mini-batch actually corresponds to the 'batch' in the parameter 'batch size", which is used in most deep learning software. The use of such mini-batch introduces stochasticity, which sometimes helps jump out of local minimums. Following this direction, we can reach the extreme of introducing the utmost stochasticity: Stochastic Gradient Descent (SGD). SGD uses one sample randomly drawn from the whole batch to compute the gradients for updating the model.
Most of the optimizers that we use in deep learning are variations of MGBD. The Momentum method includes the gradients of the previous update step in addition to the use of the gradients of the current step, which helps the optimizer jump off the local optima or saddle points. The Nesterov or Nesterov momentum uses the gradients at a "future" point for update and thus can be viewed as a modification of the Momentum method. AdaGrad is proposed to automatically adjust the learning rate. AdaDelta is proposed to address three issues with AdaGrad: 1) monotonically decreasing learning rate, which can lead to excessively small learning rates in late training stages, 2) inconsistent units on two sides of the update equation, and 3) the manually set initial learning rate. As a variation and a special case of AdaDelta, RMSprop is between AdaGrad and AdaDelta. Adaptive Moment Estimation (Adam) is essentially RMSprop with momentum and can compute adaptive learning rates. Nadam can be viewed as an Adam variation with Nesterov momentum. More details are available
in the chapter for deep learning.
Gradient-based methods are usually explained using the idea of hill-climbing (in fact, "valley-descending") as we move along the fastest descent direction to move towards the bottom of a valley. Alternatively, we can understand gradientbased methods as a process of approximating the true objective function using an approximate function and then using the minimal point of this approximate function as the next point to move to in an iterative process.
Figure 17.1: Approximation for optimization
As illustrated in Fig. 17.1, we use a function g n ( x ) g n ( x ) g_(n)(x)g_{n}(x)gn(x) to approximate the objective function f ( x ) f ( x ) f(x)f(x)f(x), for which the exact function is unknown yet can be complex. Since we know the approximate function, its minimal value and the corresponding location can be conveniently identified. As we move to the next iterative step, we will use g n + 1 ( x ) g n + 1 ( x ) g_(n+1)(x)g_{n+1}(x)gn+1(x) for the approximation. g n ( x ) g n ( x ) g_(n)(x)g_{n}(x)gn(x) and g n + 1 ( x ) g n + 1 ( x ) g_(n+1)(x)g_{n+1}(x)gn+1(x) usually have the same mathematical function but include terms embedding local information, e.g., values and gradients at the current step. The approximate function in gradient-based methods is as follows.
(17.89) g ( x ) = f ( x n ) + f ( x n ) ( x x n ) + 1 2 α ( x x n ) 2 (17.89) g ( x ) = f x n + f x n x x n + 1 2 α x x n 2 {:(17.89)g(x)=f(x_(n))+f^(')(x_(n))(x-x_(n))+(1)/(2alpha)(x-x_(n))^(2):}\begin{equation*} g(x)=f\left(x_{n}\right)+f^{\prime}\left(x_{n}\right)\left(x-x_{n}\right)+\frac{1}{2 \alpha}\left(x-x_{n}\right)^{2} \tag{17.89} \end{equation*}(17.89)g(x)=f(xn)+f(xn)(xxn)+12α(xxn)2
This curve depicts a second-order polynomial that is concave upward. So, we can easily identify the minimum, which happens at g ( x ) = 0 g ( x ) = 0 g^(')(x)=0g^{\prime}(x)=0g(x)=0. This will give us the location for the minimum value:
(17.90) x = x n α f ( x n ) (17.90) x = x n α f x n {:(17.90)x=x_(n)-alphaf^(')(x_(n)):}\begin{equation*} x=x_{n}-\alpha f^{\prime}\left(x_{n}\right) \tag{17.90} \end{equation*}(17.90)x=xnαf(xn)
where α α alpha\alphaα is the well-known hyper-parameter, learning rate, or learning step size, which is critical to optimization.
Therefore, we can obtain the following equation for update.
(17.91) x n + 1 = x n α f ( x n ) (17.91) x n + 1 = x n α f x n {:(17.91)x_(n+1)=x_(n)-alphaf^(')(x_(n)):}\begin{equation*} x_{n+1}=x_{n}-\alpha f^{\prime}\left(x_{n}\right) \tag{17.91} \end{equation*}(17.91)xn+1=xnαf(xn)
In high-order spaces, i.e., x x xxx has multiple entities (attributes), then this equation becomes the general update equation for gradient-based methods:
(17.92) x n + 1 = x n α ( x n ) (17.92) x n + 1 = x n α x n {:(17.92)x_(n+1)=x_(n)-alpha grad(x_(n)):}\begin{equation*} x_{n+1}=x_{n}-\alpha \nabla\left(x_{n}\right) \tag{17.92} \end{equation*}(17.92)xn+1=xnα(xn)
The same idea can be used for optimization but with different approximate functions. In the following, we will see that the above equation can be slightly modified by including a second-order gradient to reach Newton's method. In fact, the selection of more general approximate equations that do not necessarily include any gradients or even without an explicit math function could lead to the idea of Expectation Maximization methods to be introduced in the final. The Conjugate Gradient method and its variations, as an intermediate method between gradient-based methods and Newton's method, use an approximate function with an implicit relationship between the f ( x ) f ( x ) f^(')(x)f^{\prime}(x)f(x) and α α alpha\alphaα, which needs to be obtained in an iterative way.

17.3.2. Newton's Method and Quasi-Newton Methods

Using the same approximation concept, we adopt the following equation from Taylor's expansion to approximate the objective function:
(17.93) g ( x ) = f ( x n ) + f ( x n ) ( x x n ) + 1 2 f ( x n ) ( x x n ) 2 (17.93) g ( x ) = f x n + f x n x x n + 1 2 f x n x x n 2 {:(17.93)g(x)=f(x_(n))+f^(')(x_(n))(x-x_(n))+(1)/(2)f^('')(x_(n))(x-x_(n))^(2):}\begin{equation*} g(x)=f\left(x_{n}\right)+f^{\prime}\left(x_{n}\right)\left(x-x_{n}\right)+\frac{1}{2} f^{\prime \prime}\left(x_{n}\right)\left(x-x_{n}\right)^{2} \tag{17.93} \end{equation*}(17.93)g(x)=f(xn)+f(xn)(xxn)+12f(xn)(xxn)2
For this equation, we can obtain the minimum objective function value at the following point:
(17.94) x = x n f ( x n ) f ( x n ) (17.94) x = x n f x n f x n {:(17.94)x=x_(n)-(f^(')(x_(n)))/(f^('')(x_(n))):}\begin{equation*} x=x_{n}-\frac{f^{\prime}\left(x_{n}\right)}{f^{\prime \prime}\left(x_{n}\right)} \tag{17.94} \end{equation*}(17.94)x=xnf(xn)f(xn)
This will surrender the following equation for update:
(17.95) x n + 1 = x n f ( x n ) f ( x n ) (17.95) x n + 1 = x n f x n f x n {:(17.95)x_(n+1)=x_(n)-(f^(')(x_(n)))/(f^('')(x_(n))):}\begin{equation*} x_{n+1}=x_{n}-\frac{f^{\prime}\left(x_{n}\right)}{f^{\prime \prime}\left(x_{n}\right)} \tag{17.95} \end{equation*}(17.95)xn+1=xnf(xn)f(xn)
Newton's method can help us quickly find a global minimum if f ( x ) f ( x ) f(x)f(x)f(x) is smooth enough. However, the actual objective function is usually much more complicated than that. If the computational demand is not a concern, Newton's method is, in general, much better than gradient-based methods. However, the calculation of the second-order gradients, i.e., the Hessian matrix, could be very computationally expensive. To address this issue, many quasi-Newton methods such as BFGS, SR1 formula, and BHHH have been proposed to approximate the Hessian matrix.
It is worthwhile to mention Newton's method has also been widely used for solving equations. But equation solution and optimization can be correlated. The corresponding problem for the equation solution is f ( x ) = 0 f ( x ) = 0 f^(')(x)=0f^{\prime}(x)=0f(x)=0. That is, the search for the minimum of f ( x ) f ( x ) f(x)f(x)f(x) is the same problem as solving the equation f ( x ) = 0 f ( x ) = 0 f^(')(x)=0f^{\prime}(x)=0f(x)=0.

17.3.3. Conjugate Gradient Methods

The Conjugate Gradient (CG) approach and its variations have been widely used for solving large-scale linear systems of equations and nonlinear optimization problems. First-order methods like gradient-based methods have a slow convergence speed, while second-order methods like Newton's methods are resource-heavy. Conjugate gradient optimization is an intermediate algorithm, it integrates the advantages of only utilizing first-order information while ensuring the convergence speeds of high-order methods. Similar to gradient-based methods and quasi-Newton methods, methods in the CG family are also among the approximate optimization methods, which use a function to approximate the objective function and search for the optimum using this approximate function. This can be proven, but the deduction is excluded here considering it is not essential in the understanding and use of this type of method.
Compared with gradient-based methods, CG can address one critical question: how to determine the learning step size. In gradient-based methods, such a step size is determined either manually based on experience or automatically using some algorithms. The determination of the step size is critical to the optimization speed and results in many cases and still needs to be carefully handled to ensure acceptable results. Newton's method and quasi-Newton methods use the second-order gradient to address this issue, which offers a more accurate but computationally expensive solution. CG does not need to resort to second-order gradients, thus circumvents the trouble of computing them.
To reach a general deduction, we still resort to the following general equation for update:
(17.96) x i + 1 = x i + α i r i (17.96) x i + 1 = x i + α i r i {:(17.96) vec(x)_(i+1)= vec(x)_(i)+alpha_(i) vec(r)_(i):}\begin{equation*} \vec{x}_{i+1}=\vec{x}_{i}+\alpha_{i} \vec{r}_{i} \tag{17.96} \end{equation*}(17.96)xi+1=xi+αiri
CG is mostly used for data spaces with two or more dimensions, i.e., the model coefficient x x vec(x)\vec{x}x ( θ θ vec(theta)\vec{\theta}θ in other chapters; please be aware x x vec(x)\vec{x}x is not a sample here) has two or more parameters. Thus, we use x x vec(x)\vec{x}x rather than x x xxx as in those subsections for other optimization methods.
The essence of CG can be understood as follows. Like regular gradient-based methods, CG also moves along the fastest descent direction. Therefore, the step size vector α i r i α i r i alpha_(i) vec(r)_(i)\alpha_{i} \vec{r}_{i}αiri is proportional to the gradients: r i = d f ( x i ) d x i r i = d f x i d x i vec(r)_(i)=-(df( vec(x)_(i)))/(d vec(x)_(i))\vec{r}_{i}=-\frac{d f\left(\vec{x}_{i}\right)}{d \vec{x}_{i}}ri=df(xi)dxi and r i + 1 = r i + 1 = vec(r)_(i+1)=\vec{r}_{i+1}=ri+1= d f ( x i + 1 ) d x i + 1 d f x i + 1 d x i + 1 -(df( vec(x)_(i+1)))/(d vec(x)_(i+1))-\frac{d f\left(\vec{x}_{i+1}\right)}{d \vec{x}_{i+1}}df(xi+1)dxi+1. However, unlike the relatively "random" selection of α α alpha\alphaα, CG determines the step size by searching for the lowest objective function value along this fastest descending direction. We know that the lowest objective function presents the following condition:
(17.97) d f ( x i + 1 ) d α i = 0 (17.97) d f x i + 1 d α i = 0 {:(17.97)(df( vec(x)_(i+1)))/(dalpha_(i))=0:}\begin{equation*} \frac{d f\left(\vec{x}_{i+1}\right)}{d \alpha_{i}}=0 \tag{17.97} \end{equation*}(17.97)df(xi+1)dαi=0
where x i x i vec(x)_(i)\vec{x}_{i}xi is the state variable in the current step (or model parameters " θ i θ i vec(theta)_(i)\vec{\theta}_{i}θi " in the current optimization step) and thus x i + 1 x i + 1 vec(x)_(i+1)\vec{x}_{i+1}xi+1 is the optimization state (model) that we will reach if we move along the fastest descending direction.
The following step is the key. We reformulate the above equation using the chain rule.
(17.98) d f ( x i + 1 ) d x i + 1 d x i + 1 d α i = r i + 1 r i = 0 (17.98) d f x i + 1 d x i + 1 d x i + 1 d α i = r i + 1 r i = 0 {:(17.98)(df( vec(x)_(i+1)))/(d vec(x)_(i+1))*(d vec(x)_(i+1))/(dalpha_(i))=- vec(r)_(i+1)* vec(r)_(i)=0:}\begin{equation*} \frac{d f\left(\vec{x}_{i+1}\right)}{d \vec{x}_{i+1}} \cdot \frac{d \vec{x}_{i+1}}{d \alpha_{i}}=-\vec{r}_{i+1} \cdot \vec{r}_{i}=0 \tag{17.98} \end{equation*}(17.98)df(xi+1)dxi+1dxi+1dαi=ri+1ri=0
In the remaining deduction, we will use a quadratic optimization problem, which will minimize the following function if A ¯ A ¯ bar(A)\bar{A}A¯ is positive definite: f ( x ) = 1 2 x T A x b T x + c f ( x ) = 1 2 x T A x b T x + c f( vec(x))=(1)/(2) vec(x)^(T)* vec(A)* vec(x)- vec(b)^(T)* vec(x)+cf(\vec{x})=\frac{1}{2} \vec{x}^{T} \cdot \vec{A} \cdot \vec{x}-\vec{b}^{T} \cdot \vec{x}+cf(x)=12xTAxbTx+c.
It is noted the optimization problem is equivalent to the solution to the linear function: A ¯ T x = b A ¯ T x = b bar(A)^(T)* vec(x)= vec(b)\bar{A}^{T} \cdot \vec{x}=\vec{b}A¯Tx=b.
Next, let us take a look at two basic parameters defined for this quadratic problem.
Error:
(17.99) e i = x i x (17.99) e i = x i x {:(17.99) vec(e)_(i)= vec(x)_(i)- vec(x):}\begin{equation*} \vec{e}_{i}=\vec{x}_{i}-\vec{x} \tag{17.99} \end{equation*}(17.99)ei=xix
Residual: A ¯ T x i b A ¯ T x i b bar(A)^(T)* vec(x)_(i)!= vec(b)\bar{A}^{T} \cdot \vec{x}_{i} \neq \vec{b}A¯Txib because x i x x i x vec(x)_(i)!= vec(x)\vec{x}_{i} \neq \vec{x}xix
(17.100) r i = b A ¯ T x i (17.100) r i = b A ¯ T x i {:(17.100) vec(r)_(i)= vec(b)- bar(A)^(T)* vec(x)_(i):}\begin{equation*} \vec{r}_{i}=\vec{b}-\bar{A}^{T} \cdot \vec{x}_{i} \tag{17.100} \end{equation*}(17.100)ri=bA¯Txi
We can derive
(17.101) r i = ( b A ¯ T x i ) ( b A ¯ T x ) = A ¯ T ( x i x ) = A ¯ e i (17.101) r i = b A ¯ T x i b A ¯ T x = A ¯ T x i x = A ¯ e i {:(17.101) vec(r)_(i)=(( vec(b))- bar(A)^(T)* vec(x)_(i))-(( vec(b))- bar(A)^(T)*( vec(x)))=- bar(A)^(T)*( vec(x)_(i)-( vec(x)))=- bar(A)* vec(e)_(i):}\begin{equation*} \vec{r}_{i}=\left(\vec{b}-\bar{A}^{T} \cdot \vec{x}_{i}\right)-\left(\vec{b}-\bar{A}^{T} \cdot \vec{x}\right)=-\bar{A}^{T} \cdot\left(\vec{x}_{i}-\vec{x}\right)=-\bar{A} \cdot \vec{e}_{i} \tag{17.101} \end{equation*}(17.101)ri=(bA¯Txi)(bA¯Tx)=A¯T(xix)=A¯ei
The above equation shows the relationship between r i r i vec(r)_(i)\vec{r}_{i}ri and e i e i vec(e)_(i)\vec{e}_{i}ei, which is very useful for us to understand the problem. However, we use d f ( x i ) d x i d f x i d x i -(df( vec(x)_(i)))/(d vec(x)_(i))-\frac{d f\left(\vec{x}_{i}\right)}{d \vec{x}_{i}}df(xi)dxi to calculate r i r i vec(r)_(i)\vec{r}_{i}ri, because e i e i vec(e)_(i)\vec{e}_{i}ei cannot be calculated as the true optimal point x i x i vec(x)_(i)\vec{x}_{i}xi is unknown.
Next, let us show how to derive the formulation for calculating α α alpha\alphaα :
r i + 1 T r i = ( b A ¯ T x i + 1 ) T r i (17.102) = [ b A ¯ T ( x i + α i r i ) ] T r i = ( b A ¯ T x i ) T r i α i ( A ¯ T r i ) T r i = r i T r i α i ( A ¯ T r i ) T r i = 0 r i + 1 T r i = b A ¯ T x i + 1 T r i (17.102) = b A ¯ T x i + α i r i T r i = b A ¯ T x i T r i α i A ¯ T r i T r i = r i T r i α i A ¯ T r i T r i = 0 {:[ vec(r)_(i+1)^(T)* vec(r)_(i)=(( vec(b))- bar(A)^(T)* vec(x)_(i+1))^(T)* vec(r)_(i)],[(17.102)=[( vec(b))- bar(A)^(T)*( vec(x)_(i)+alpha_(i) vec(r)_(i))]^(T)* vec(r)_(i)],[=(( vec(b))- bar(A)^(T)* vec(x)_(i))^(T)* vec(r)_(i)-alpha_(i)( bar(A)^(T)* vec(r)_(i))^(T)* vec(r)_(i)],[= vec(r)_(i)^(T)* vec(r)_(i)-alpha_(i)( bar(A)^(T)* vec(r)_(i))^(T)* vec(r)_(i)=0]:}\begin{align*} \vec{r}_{i+1}^{T} \cdot \vec{r}_{i} & =\left(\vec{b}-\bar{A}^{T} \cdot \vec{x}_{i+1}\right)^{T} \cdot \vec{r}_{i} \\ & =\left[\vec{b}-\bar{A}^{T} \cdot\left(\vec{x}_{i}+\alpha_{i} \vec{r}_{i}\right)\right]^{T} \cdot \vec{r}_{i} \tag{17.102}\\ & =\left(\vec{b}-\bar{A}^{T} \cdot \vec{x}_{i}\right)^{T} \cdot \vec{r}_{i}-\alpha_{i}\left(\bar{A}^{T} \cdot \vec{r}_{i}\right)^{T} \cdot \vec{r}_{i} \\ & =\vec{r}_{i}^{T} \cdot \vec{r}_{i}-\alpha_{i}\left(\bar{A}^{T} \cdot \vec{r}_{i}\right)^{T} \cdot \vec{r}_{i}=0 \end{align*}ri+1Tri=(bA¯Txi+1)Tri(17.102)=[bA¯T(xi+αiri)]Tri=(bA¯Txi)Triαi(A¯Tri)Tri=riTriαi(A¯Tri)Tri=0
Then, we can obtain the following equation for calculating α i α i alpha_(i)\alpha_{i}αi :
(17.103) α i = r i T r i r i T A ¯ T r i (17.103) α i = r i T r i r i T A ¯ T r i {:(17.103)alpha_(i)=( vec(r)_(i)^(T)* vec(r)_(i))/( vec(r)_(i)^(T)* bar(A)^(T)* vec(r)_(i)):}\begin{equation*} \alpha_{i}=\frac{\vec{r}_{i}^{T} \cdot \vec{r}_{i}}{\vec{r}_{i}^{T} \cdot \bar{A}^{T} \cdot \vec{r}_{i}} \tag{17.103} \end{equation*}(17.103)αi=riTririTA¯Tri
CG can be implemented with different iterative procedures, leading to different CG variations. One intuitive way is to use the following three equations:
(17.104) r i = b A ¯ T x i (17.105) α i = r i T r i r i T A ¯ T r i (17.106) x i + 1 = x i + α i r i (17.104) r i = b A ¯ T x i (17.105) α i = r i T r i r i T A ¯ T r i (17.106) x i + 1 = x i + α i r i {:[(17.104) vec(r)_(i)= vec(b)- bar(A)^(T)* vec(x)_(i)],[(17.105)alpha_(i)=( vec(r)_(i)^(T)* vec(r)_(i))/( vec(r)_(i)^(T)* bar(A)^(T)* vec(r)_(i))],[(17.106) vec(x)_(i+1)= vec(x)_(i)+alpha_(i) vec(r)_(i)]:}\begin{align*} & \vec{r}_{i}=\vec{b}-\bar{A}^{T} \cdot \vec{x}_{i} \tag{17.104}\\ & \alpha_{i}=\frac{\vec{r}_{i}^{T} \cdot \vec{r}_{i}}{\vec{r}_{i}^{T} \cdot \bar{A}^{T} \cdot \vec{r}_{i}} \tag{17.105}\\ & \vec{x}_{i+1}=\vec{x}_{i}+\alpha_{i} \vec{r}_{i} \tag{17.106} \end{align*}(17.104)ri=bA¯Txi(17.105)αi=riTririTA¯Tri(17.106)xi+1=xi+αiri
Alternatively, we can multiply (dot product) the two sides of Eq. 17.106 by A ¯ A ¯ - bar(A)-\bar{A}A¯ and use the relationship r i = A ¯ ( x i r i = A ¯ x i vec(r)_(i)=- bar(A)( vec(x)_(i)-:}\vec{r}_{i}=-\bar{A}\left(\vec{x}_{i}-\right.ri=A¯(xi x ) x ) vec(x))\vec{x})x), yielding
(17.107) r i + 1 = r i α i A ¯ r i (17.107) r i + 1 = r i α i A ¯ r i {:(17.107) vec(r)_(i+1)= vec(r)_(i)-alpha_(i) bar(A)* vec(r)_(i):}\begin{equation*} \vec{r}_{i+1}=\vec{r}_{i}-\alpha_{i} \bar{A} \cdot \vec{r}_{i} \tag{17.107} \end{equation*}(17.107)ri+1=riαiA¯ri
This can be used with Eq. 17.105 for the update. However, it is worthwhile to mention that this would involve more numerical errors. The use of x i x i vec(x)_(i)\vec{x}_{i}xi for calculating r i r i vec(r)_(i)\vec{r}_{i}ri frequently can help alleviate this issue.
The above deduction was presented based on a quadratic program problem (or linear equation solution). Notwithstanding, CG has been extended for handling nonlinear problems such as Fletcher-Reeves method, Hestenes-Stiefel method, and Dai-Yuan method [176]. However, it is noted that CG has rarely been used for large-scale deep-learning problems. The nonlinear versions of CG require a line search to identify the lowest point along the direction of the fastest descent. This could cause difficulty in applying CG to nonlinear problems, such as additional computational effort, but this may
not be the primary reason. One major concern lies in the use of batches for line search. The use of the whole batch can lead to high computing demands but, importantly, force the optimizer to converge to the nearest local minimum, which is not desired. Notwithstanding, such issues may be addressed in the future. Thus, CG is still introduced here considering its potential.

17.3.4. Expectation Maximization Methods

As mentioned above, the Expectation Maximization (EM) methods have a much broader definition. Due to the same reason, it is hard to define a general approximate function for them like what we did for gradient-based and Newton's methods. Also, the formulation and implementation of this type of method can be much different, depending on the applications and algorithms. Thus, the introduction to EM in this subsection will be divided into two parts. First, we will play with the math underneath one typical example to show what an approximate function could look like. Next, we will go through one simple example to get a more intuitive understanding and first-hand experience about the implementation of the basic EM method.
First, to peek into the approximate function, let us use cross entropy to construct the objective function to set up an example.
(17.108) = x , y p ~ ( x , y ) log p ( x , y ) (17.108) = x , y p ~ ( x , y ) log p ( x , y ) {:(17.108)ℓ=-sum_(x,y) tilde(p)(x","y)log p(x","y):}\begin{equation*} \ell=-\sum_{x, y} \tilde{p}(x, y) \log p(x, y) \tag{17.108} \end{equation*}(17.108)=x,yp~(x,y)logp(x,y)
where x x xxx and y y yyy are two state variables, p p ppp is the probability, and p ~ p ~ tilde(p)\tilde{p}p~ is the prediction of p p ppp.
In a typical classification problem, let us use z z zzz to represent the class and then apply the chain rule. We can obtain
(17.109) = x , y p ~ ( x , y ) log z p ( x z ) p ( z y ) p ( y ) (17.109) = x , y p ~ ( x , y ) log z p ( x z ) p ( z y ) p ( y ) {:(17.109)ℓ=-sum_(x,y) tilde(p)(x","y)log sum_(z)p(x∣z)p(z∣y)p(y):}\begin{equation*} \ell=-\sum_{x, y} \tilde{p}(x, y) \log \sum_{z} p(x \mid z) p(z \mid y) p(y) \tag{17.109} \end{equation*}(17.109)=x,yp~(x,y)logzp(xz)p(zy)p(y)
p ( y ) p ( y ) p(y)p(y)p(y) is a fixed number. So, the above objective function can be modified by removing this term. So, we have
(17.110) = x , y p ~ ( x , y ) log z p ( x z ) p ( z y ) (17.110) = x , y p ~ ( x , y ) log z p ( x z ) p ( z y ) {:(17.110)ℓ=-sum_(x,y) tilde(p)(x","y)log sum_(z)p(x∣z)p(z∣y):}\begin{equation*} \ell=-\sum_{x, y} \tilde{p}(x, y) \log \sum_{z} p(x \mid z) p(z \mid y) \tag{17.110} \end{equation*}(17.110)=x,yp~(x,y)logzp(xz)p(zy)
The following gradients can be obtained:
(17.111) p ( x z ) = y p ~ ( x , y ) p ( z y ) z p ( x z ) p ( z y ) (17.112) p ( z y ) = x p ~ ( x , y ) p ( x z ) z p ( x z ) p ( z y ) (17.111) p ( x z ) = y p ~ ( x , y ) p ( z y ) z p ( x z ) p ( z y ) (17.112) p ( z y ) = x p ~ ( x , y ) p ( x z ) z p ( x z ) p ( z y ) {:[(17.111)(delℓ)/(del p(x∣z))=-sum_(y)(( tilde(p))(x,y)p(z∣y))/(sum_(z)p(x∣z)p(z∣y))],[(17.112)(delℓ)/(del p(z∣y))=-sum_(x)(( tilde(p))(x,y)p(x∣z))/(sum_(z)p(x∣z)p(z∣y))]:}\begin{align*} \frac{\partial \ell}{\partial p(x \mid z)} & =-\sum_{y} \frac{\tilde{p}(x, y) p(z \mid y)}{\sum_{z} p(x \mid z) p(z \mid y)} \tag{17.111}\\ \frac{\partial \ell}{\partial p(z \mid y)} & =-\sum_{x} \frac{\tilde{p}(x, y) p(x \mid z)}{\sum_{z} p(x \mid z) p(z \mid y)} \tag{17.112} \end{align*}(17.111)p(xz)=yp~(x,y)p(zy)zp(xz)p(zy)(17.112)p(zy)=xp~(x,y)p(xz)zp(xz)p(zy)
The application of gradient-based methods in this case is not easy because of extra constraints: x p ( x z ) = 1 x p ( x z ) = 1 sum_(x)p(x∣z)=1\sum_{x} p(x \mid z)=1xp(xz)=1, y p ( z y ) = 1 , p ( x z ) 0 y p ( z y ) = 1 , p ( x z ) 0 sum_(y)p(z∣y)=1,p(x∣z) >= 0\sum_{y} p(z \mid y)=1, p(x \mid z) \geqslant 0yp(zy)=1,p(xz)0, and p ( z y ) 0 p ( z y ) 0 p(z∣y) >= 0p(z \mid y) \geqslant 0p(zy)0. The position of log in the above loss function causes difficulty. Thus, we can introduce the following approximate function:
(17.113) g = x , y p ~ ( x , y ) z [ C x , y , z log ( p ( x z ) p ( z y ) ) ] (17.113) g = x , y p ~ ( x , y ) z C x , y , z log ( p ( x z ) p ( z y ) ) {:(17.113)g=-sum_(x,y) tilde(p)(x","y)sum_(z)[C_(x,y,z)log(p(x∣z)p(z∣y))]:}\begin{equation*} g=-\sum_{x, y} \tilde{p}(x, y) \sum_{z}\left[C_{x, y, z} \log (p(x \mid z) p(z \mid y))\right] \tag{17.113} \end{equation*}(17.113)g=x,yp~(x,y)z[Cx,y,zlog(p(xz)p(zy))]
The gradients of this appropriate objective function are
(17.114) g p ( x z ) = y p ~ ( x , y ) C x , y , z p ( x z ) (17.115) g p ( z y ) = x p ~ ( x , y ) C x , y , z p ( z y ) (17.114) g p ( x z ) = y p ~ ( x , y ) C x , y , z p ( x z ) (17.115) g p ( z y ) = x p ~ ( x , y ) C x , y , z p ( z y ) {:[(17.114)(del g)/(del p(x∣z))=-sum_(y)(( tilde(p))(x,y)C_(x,y,z))/(p(x∣z))],[(17.115)(del g)/(del p(z∣y))=-sum_(x)(( tilde(p))(x,y)C_(x,y,z))/(p(z∣y))]:}\begin{align*} \frac{\partial g}{\partial p(x \mid z)} & =-\sum_{y} \frac{\tilde{p}(x, y) C_{x, y, z}}{p(x \mid z)} \tag{17.114}\\ \frac{\partial g}{\partial p(z \mid y)} & =-\sum_{x} \frac{\tilde{p}(x, y) C_{x, y, z}}{p(z \mid y)} \tag{17.115} \end{align*}(17.114)gp(xz)=yp~(x,y)Cx,y,zp(xz)(17.115)gp(zy)=xp~(x,y)Cx,y,zp(zy)
Comparing the gradients of this approximate function with those of the original function, e.g., gradℓ\nabla \ell vs g g grad g\nabla gg, it is not difficult to obtain
(17.116) C x , y , z = p ( x z ) p ( z y ) z [ p ( x z ) p ( z y ) ] (17.116) C x , y , z = p ( x z ) p ( z y ) z [ p ( x z ) p ( z y ) ] {:(17.116)C_(x,y,z)=(p(x∣z)p(z∣y))/(sum_(z)[p(x∣z)p(z∣y)]):}\begin{equation*} C_{x, y, z}=\frac{p(x \mid z) p(z \mid y)}{\sum_{z}[p(x \mid z) p(z \mid y)]} \tag{17.116} \end{equation*}(17.116)Cx,y,z=p(xz)p(zy)z[p(xz)p(zy)]
Substituting the above equation into the g g ggg function will generate an approximate function for this specific example. As can be seen, this is much different from those in the gradient-based methods and Newton's method. Other approximate functions can be derived for other EM applications.
The above deduction is provided to show the similarity between EM and other optimization methods. However, implementations of EM do not need the above deduction and are different from the previous optimization methods. The key to most EM implementations is to construct and/or utilize hidden parameters. An iterative process involving alternately updating the major optimization parameters and the hidden parameters will lead to the optimal parameters and the model they define. Taking the above deduction as an example, we first can calculate C x , y , z C x , y , z C_(x,y,z)C_{x, y, z}Cx,y,z using p ( x z ) p ( x z ) p(x∣z)p(x \mid z)p(xz) and p ( z y ) p ( z y ) p(z∣y)p(z \mid y)p(zy). Then C x , y , z C x , y , z C_(x,y,z)C_{x, y, z}Cx,y,z can be used to compute p ( x z ) p ( x z ) p(x∣z)p(x \mid z)p(xz) and p ( z y ) p ( z y ) p(z∣y)p(z \mid y)p(zy).
The above deduction is performed based on a meaningful yet abstract example. To gain a more intuitive understanding, let us consider one real-world example that is easy to understand.
In this example, we tossed two coins. Each time we tossed one coin and recorded the number of times that the two sides of each coin appeared, i.e., heads and tails. Now, we will show how to compute the expected probabilities of getting the sides for each coin and guess which coin was used each time.
Table 17.1: Results of coin toss
Round Coin #A Coin #B
1 - 5 Heads, 5 Tails
2 9 Heads, 1 Tail -
3 8 Heads, 2 Tails -
4 - 4 Heads, 6 Tails
5 7 Heads, 3 Tails -
Round Coin #A Coin #B 1 - 5 Heads, 5 Tails 2 9 Heads, 1 Tail - 3 8 Heads, 2 Tails - 4 - 4 Heads, 6 Tails 5 7 Heads, 3 Tails -| Round | Coin #A | Coin #B | | :--- | :---: | :---: | | 1 | - | 5 Heads, 5 Tails | | 2 | 9 Heads, 1 Tail | - | | 3 | 8 Heads, 2 Tails | - | | 4 | - | 4 Heads, 6 Tails | | 5 | 7 Heads, 3 Tails | - |
Table 17.1 shows the results of the test. Please be aware that which coin was selected for each round of toss is unknown when solving the problem. However, this fact is recorded in the above table for validation purposes.
This problem can be solved using EM. The key is to identify a hidden parameter(s). Before that, we need to figure out the model to be sought, or in other words, the parameters to be optimized. These parameters can be the probabilities of getting the two sides for each coin. Let us use P A P A P_(A)P_{A}PA and P B P B P_(B)P_{B}PB to represent the probabilities of obtaining the heads for the two coins. Accordingly, the probabilities for getting the tail side for the two coins are 1 P A 1 P A 1-P_(A)1-P_{A}1PA and 1 P B 1 P B 1-P_(B)1-P_{B}1PB. In the above problem description, we know there are other unknown parameters, that is, "which coin was tossed" in each of the five rounds. Intuitively, we can consider using these unknowns as the hidden parameters. Let us use [ θ 1 , θ 2 , θ 3 , θ 4 , θ 5 ] θ 1 , θ 2 , θ 3 , θ 4 , θ 5 [theta_(1),theta_(2),theta_(3),theta_(4),theta_(5)]\left[\theta_{1}, \theta_{2}, \theta_{3}, \theta_{4}, \theta_{5}\right][θ1,θ2,θ3,θ4,θ5] to represent them, e.g., θ 1 θ 1 theta_(1)\theta_{1}θ1 is the probability of selecting Coin A in the first round, and likewise, 1 θ 1 1 θ 1 1-theta_(1)1-\theta_{1}1θ1 is the probability of selecting Coin B in the first round.
Let us start with random values P A = 0.6 P A = 0.6 P_(A)=0.6P_{A}=0.6PA=0.6 and P B = 0.5 P B = 0.5 P_(B)=0.5P_{B}=0.5PB=0.5 for initialization. Iteration 1: θ 1 = 0.6 , P B = 0.5 θ 1 = 0.6 , P B = 0.5 theta_(1)=0.6,P_(B)=0.5\theta_{1}=0.6, P_{B}=0.5θ1=0.6,PB=0.5
Round 1: P A = 0.6 5 0.4 5 0.6 5 0.4 5 + 0.5 5 0.5 5 = 0.45 P A = 0.6 5 0.4 5 0.6 5 0.4 5 + 0.5 5 0.5 5 = 0.45 P_(A)=(0.6^(5)*0.4^(5))/(0.6^(5)*0.4^(5)+0.5^(5)*0.5^(5))=0.45P_{A}=\frac{0.6^{5} \cdot 0.4^{5}}{0.6^{5} \cdot 0.4^{5}+0.5^{5} \cdot 0.5^{5}}=0.45PA=0.650.450.650.45+0.550.55=0.45
for Coin A, we have the following numbers of times to get heads and tails: 5 0.45 = 2.2 , 5 0.45 = 2.2 5 0.45 = 2.2 , 5 0.45 = 2.2 5**0.45=2.2,5**0.45=2.25 * 0.45=2.2,5 * 0.45=2.250.45=2.2,50.45=2.2
for Coin B , we have the following numbers of times to get heads and tails: 5 ( 1 0.45 ) = 2.8 , 5 ( 1 0.45 ) = 2.8 5 ( 1 0.45 ) = 2.8 , 5 ( 1 0.45 ) = 2.8 5**(1-0.45)=2.8,5**(1-0.45)=2.85 *(1-0.45)=2.8,5 *(1-0.45)=2.85(10.45)=2.8,5(10.45)=2.8
Round 2: P A = 0.6 9 0.4 1 0.6 9 0.4 1 + 0.5 9 0.5 1 = 0.8 P A = 0.6 9 0.4 1 0.6 9 0.4 1 + 0.5 9 0.5 1 = 0.8 P_(A)=(0.6^(9)*0.4^(1))/(0.6^(9)*0.4^(1)+0.5^(9)*0.5^(1))=0.8P_{A}=\frac{0.6^{9} \cdot 0.4^{1}}{0.6^{9} \cdot 0.4^{1}+0.5^{9} \cdot 0.5^{1}}=0.8PA=0.690.410.690.41+0.590.51=0.8
for Coin A , we have the following numbers of times to get heads and tails: 9 0.8 = 7.2 , 1 0.8 = 0.8 9 0.8 = 7.2 , 1 0.8 = 0.8 9**0.8=7.2,1**0.8=0.89 * 0.8=7.2,1 * 0.8=0.890.8=7.2,10.8=0.8
for Coin B B BBB, we have the following numbers of times to get heads and tails: 9 ( 1 0.8 ) = 1.8 , 1 ( 1 0.8 ) = 0.2 9 ( 1 0.8 ) = 1.8 , 1 ( 1 0.8 ) = 0.2 9**(1-0.8)=1.8,1**(1-0.8)=0.29 *(1-0.8)=1.8,1 *(1-0.8)=0.29(10.8)=1.8,1(10.8)=0.2
Finishing all the five rounds for the first iteration, as shown in Table 17.2, we would get 21.3 heads and 8.6 tails for Coin A and 11.7 heads and 8.4 tails for Coin B. Then we can use the following two equations to update P A P A P_(A)P_{A}PA and P B P B P_(B)P_{B}PB, which are needed for the next iteration.
(17.117) P A = 21.3 21.3 + 8.6 = 0.71 (17.118) P B = 11.7 11.7 + 8.4 = 0.58 (17.117) P A = 21.3 21.3 + 8.6 = 0.71 (17.118) P B = 11.7 11.7 + 8.4 = 0.58 {:[(17.117)P_(A)=(21.3)/(21.3+8.6)=0.71],[(17.118)P_(B)=(11.7)/(11.7+8.4)=0.58]:}\begin{align*} & P_{A}=\frac{21.3}{21.3+8.6}=0.71 \tag{17.117}\\ & P_{B}=\frac{11.7}{11.7+8.4}=0.58 \tag{17.118} \end{align*}(17.117)PA=21.321.3+8.6=0.71(17.118)PB=11.711.7+8.4=0.58
The above iterative process can be repeated to approach the model that can best describe the data. This process illustrates a simple implementation of the maximum likelihood estimation.
Table 17.2: Results of EM larning process
Round Coin A Coin B
1 2.2 Heads, 2.2 Tails 2.8 Heads, 2.8 Tails
2 7.2 Heads, 0.8 Tails 1.8 Heads, 0.2 Tails
3 5.9 Heads, 1.5 Tails 2.1 Heads, 0.5 Tails
4 1.4 Heads, 2.1 Tails 2.6 Heads, 3.9 Tails
5 4.5 Heads, 1.9 Tails 2.5 Heads, 1.1 Tails
Total 21.3 Heads, 8.6 Tails 11.7 Heads, 8.4 Tails
Round Coin A Coin B 1 2.2 Heads, 2.2 Tails 2.8 Heads, 2.8 Tails 2 7.2 Heads, 0.8 Tails 1.8 Heads, 0.2 Tails 3 5.9 Heads, 1.5 Tails 2.1 Heads, 0.5 Tails 4 1.4 Heads, 2.1 Tails 2.6 Heads, 3.9 Tails 5 4.5 Heads, 1.9 Tails 2.5 Heads, 1.1 Tails Total 21.3 Heads, 8.6 Tails 11.7 Heads, 8.4 Tails| Round | Coin A | Coin B | | :--- | :---: | :---: | | 1 | 2.2 Heads, 2.2 Tails | 2.8 Heads, 2.8 Tails | | 2 | 7.2 Heads, 0.8 Tails | 1.8 Heads, 0.2 Tails | | 3 | 5.9 Heads, 1.5 Tails | 2.1 Heads, 0.5 Tails | | 4 | 1.4 Heads, 2.1 Tails | 2.6 Heads, 3.9 Tails | | 5 | 4.5 Heads, 1.9 Tails | 2.5 Heads, 1.1 Tails | | Total | 21.3 Heads, 8.6 Tails | 11.7 Heads, 8.4 Tails |

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models